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Abstract 

A new methodology for model determination in decomposable graphical Gaussian models 
(Dawid and Lauritzen 1993| l is developed. The Bayesian paradigm is used and, for each given 



graph, a hyper inverse Wishart prior distribution on the covariance matrix is considered. This 
prior distribution depends on hyper-parameters. 

It is well-known that the models's posterior distribution is sensitive to the specification of 
these hyper-parameters and no completely satisfactory method is registered. In order to avoid 
this problem, we suggest adopting an empirical Bayes strategy, that is a strategy for which the 
values of the hyper-parameters are determined using the data. Typically, the hyper-parameters 
are fixed to their maximum likelihood estimations. In order to calculate these maximum 
likelihood estimations, we suggest a Markov chain Monte Carlo version of the Stochastic 
Approximation EM algorithm. 

Moreover, we introduce a new sampling scheme in the space of graphs that improves the add 
and delete proposal of Armstrong et al. ( 2009[ l. We illustrate the efficiency of this new scheme 



on simulated and real datasets. 
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1 Gaussian graphical models in a Bayesian Context 



Statistical applications in genetics, sociology, biology , etc often lead to complicated interaction 
patterns between variables. Graphical models have proved to be powerful tools to represent the 
conditional independence structure of a multivariate distribution : the nodes represent the variables 
and the absence of an edge between two vertices indicates some conditional independence between 
the associated variables. 

Our paper presents a new approach for estimating the graph structure in Gaussian graphical model. 
A very large literature deals with this issue in the Bayesian paradigm: Dawid and Lauritzen'fl993 1; 
Madigan and Raftery| ( |1994| ); [Giudici and Green] ( |1999| ); [Jones et al.| (2005) ; ^Armstrong et al. 



(2009 1; Carvalho and Scott (2009 1. For a frequentist point of view, one can see Drton and Perhnan 



(20041. 



We suggest here an empirical Bayes approach: the parameter of the prior are estimated from the 
data. Parametric empirical Bayes methods have a long history, with major developments evolv- 
ing in the sequence of papers by |Efron and Monis| ( |1971[ri972b|a[[l973a|b[[T976b|a[ ). Empirical 
Bayes estimation falls outside the Bayesian paradigm. However, it has proven to be an effective 
technique of constructing estimators that performs well under both Bayesian and frequentist cri- 
teria. Moreover, in the case of decomposable Gaussian graphical models, it gives a default and 
objective way for constructing prior distribution. The theory and applications of empirical Bayes 
methods are given by [Morris ( 1983 1. 

In this Section, we first recall some results on Gaussian graphical models, then we justify the use 
of the empirical Bayes strategy. 



1.1 Background on Gaussian graphical models 

Let Q = (V, E) be an undirected graph with vertices V = {1, . . . ,p} and set of edges 

E = {ei, . . . , et], (Vi = 1, . . . , t, ej G F X V). Using the notations of [Giudici and Green[([ 1999b, 



we first recall the definition of a decomposable graph. A graph or subgraph is said to be complete 
if all pairs of its vertices are joined by edges. Moreover, a complete subgraph that is not contained 
within another complete subgraph is called a clique. Let C = {Ci, . . . , C^} be the set of the 
cliques of an undirected graph. 

An order of the cliques (Ci, . . . , Ck) is said to beperfect if Vi = 2, . . . ,k,3h = h{i) G {1, . . . , i— 
1} such that Si = Ci D U^^^Cj C Ch- S = {S2, ■ ■ ■ , Sk} is the set of separators associated 
to the perfect order {Ci, . . . ,Ck}- An undirected graph admitting a perfect order is said to be 
decomposable. Let Vp denote the set of decomposable graphs with p vertices. For more details. 



one can refer to Dawid and Lauritzen ( 1993 1, Lauritzen ( 1996 1 (Chapters 2, 3 and 5) or Giudici 
and Green[ ( [T999] ). 

The graph drawn in Figure[T]- and used as benchmark in numerical Section [42] - is decomposable. 
Indeed, the set of cliques Ci = {1,2,3}, C2 = {2,3,5,6}, C3 = {2,4,5}, C4 = {5,6,7} 
and C5 = {6,7,8,9} with associated separators ^2 = {2,3}, ^3 = {2,5}, S4 = {5,6} and 
'S'5 = {6,7} forms a perfect order. 
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Figure 1 : Example of decomposable graph 

Note that, with p vertices, the total number of possible graphs is 2^^*'^^)/^, p{p — l)/2 being 
the number of possible edges. The total number of decomposable graphs with p vertices can be 
calculated for moderate values of p. For instance, if p = 6, among the 32 768 possible graphs, 
18 154 are decomposable (around 55%); if p = 8, then 30 888 596 of the 268 435 456 possible 
graphs are decomposable (around 12%). 

A pair {A, B) of subsets of the vertex set V of an undirected graph Q is said to form a decompo- 
sition of ^ if (1) y = A U i? , (2) A n -B is complete and {?>) Ar\B separates A from B ie any 
path from a vertex in A to a vertex in B goes through Ar\ B. 

To each vertex t> G F, we associate a random variable y^. For A ^^V ,yA denotes the collection 
of random variables {y^ : v ^ A}. To simplify the notation, we set y = yy. The probability 
distribution of y is said to be Markov with respect to Q, if for any decomposition (A, B) of Q, y^ 
is independent of y^ given yAnfi- A graphical model is a family of distributions on y verifying 
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the Markov property with respect to a graph. 

A Gaussian graphical model, also called covariance selection model (see [Dempster (1972l), is 
such that 

y|g,Sg~AAp(/2,Sg) , (1) 

where Np (//, Eg) denotes the p-variate Gaussian distribution with expectation and pxp 

symmetric definite positive covariance matrix Eg. Eg has to ensure the Markov property with 
respect to Q. In the Gaussian case, y is Markov with respect to ^ = {V, E) if and only if 

where denotes the inverse of the matrix A. Eg^ is called the concentration matrix. 
In the following, we suppose that we observe a sample Y = (y^, . . . , y") from model ([T]) with 
mean parameter /x set to zero. The data are expressed as a deviation from the sample mean. This 
centering strategy is standard in the literature, however the technique developed here can be easily 
extended to the case /i 7^ Op. 

The density of Y is a function of multivariate Gaussian densities on the cliques and separators 
of Q. More precisely, let C and S denote respectively the sets of the cliques and separators of Q 
corresponding to a perfect order for Q. We have : 



^^'^"'^''^ = n|n,,,^,.,(y^|(Eg).) 



(2) 



where for every subset of vertices ^, |^| denotes its cardinal and (Eg)^^ is the restriction of (Eg) 
to A i.e. {{^g)i,j}i^Aj&A ^^'^ ~ {yj)j&A- 4>q ("1^) is the g-variate Gaussian density with 
mean Og and q x q symmetric definite positive covariance matrix A. 
From a Bayesian perspective, we are interested in the posterior probabilities 

7r(g|Y) « 7r{g) I /(Y|Eg, g)7r(Eg|g)(iEg , (3) 

for specific priors 7r(Eg|^) and tt{Q). In the following, we discuss the choice of these prior 
distributions. 

1.2 Prior distributions specification 

Prior and posterior distributions for the covariance matrix 

Conditionally on ^, we set an Hyper- Inverse Wishart (HIW) distribution as prior distribution on 

Eg: 

Eg|g,5,$ -HlWg (5,$) , 
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where 5 > is the degree of freedom and ^ is a p x p symmetric positive definite location 
matrix. This distribution is the unique hyper-Markov distribution such that, for every chque C G 
C, ~ IW{6, ^c) with density 



7r((Se)c|<5,<i>c) = /i^^(5,$c)[det(Se) 



exp{-itr[(I]e)^i<i>c]} 



(4) 



where {6, ^c) is the normaUzing constant: 



det ('%^) 



(\C\+S-l)/2 



^\C\ [ 2 



(5) 



where det(-) and tr(-) are respectively the determinant and trace and r„ is the muhivariate F- 
f unction with parameter v: 



j=i 



The full joint density is: 



7r{T,g\G, S, t 



Ucec^ii^G)c\d,^c) 



(6) 



Conditionally on Q, the HIW distribution is conjugate. The posterior distribution of is given 
by ( |Giudici||T996l ): 

Sg|Y,g,5,$~HIW(5 + n,$ + 5Y) . (7) 
where Sy = J27=i V*' *^ denoting the transpose of v. 

Moreover for such a prior distribution, the marginal likelihood for any graph ^ is a simple function 



of the HIW prior and posterior normalizing constants hg{6, ^) and hg{6 + n,^ + Sy) (Giudici 



1996 1: 



/(Y|g,5,$) 



(8) 



(27r)-"P/2/ig((5 + n,$ + 5Y) ' 

where hg{6,^) is the normalizing constant of the HIW distribution which can be computed ex- 
plicitly in decomposable graphs from the normalizing constants of the inverse Wishart cliques and 
separators densities (|4]j5]|6]l : 



hg{6,^) 



Note that Roverato ( 2002 1 extends the Hyper-Inverse Wishart distribution to non- decomposable 
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cases. Moreover, a general treatment of priors for decomposable models is given by Letac and 
MassamlpOOTll. 



Prior and posterior distributions for the graphs 



The prior distribution in the space of decomposable graphs has been widely discussed in the liter- 
ature. The naive choice is to use the standard uniform prior distribution: 

7r{Q) oc 1 . 

One great advantage of this choice is simplifying the calculus but it can be criticized. Indeed, with 
p vertices, the number of possible edges is equal to m = ^^^"^-^ and, in the case of a uniform prior 
over all graphs, the prior number of edges has its mode around m/2 which is typically too large. 
An alternative to this prior is to set a Bemouilli distribution of parameter r on the inclusion or not 
of each edge ( jJones et al.||2005t|Carvalho and Scott| |2009| ) 



TT{Q\r) 



cx r 



kg. 



1 



(9) 



where kg is the number of edges of Q. The parameter r has to be calibrate. If r = 1/2, this prior 
resumes to the uniform one. 

In the following we consider this prior distribution and give an empirical estimation of r. 

Using Q and Q, we deduce easily that the density of the posterior distribution in the space of 

decomposable graphs satisfies: 



Trig\Y,5, r,$) oc 



hg{S + n,^ + Sy 



-TT{g\r) . 



(10) 



This posterior distribution is known to be sensitive to the specification of the hyper-parameters r, 5 
and $ (see [Jones et aL| ( [20051 ); [Armstrong aL] ( |2009| )). To tackle this problem various strategies 
have been developed. In the following, we supply a short review of these methods and offer an 
alternative one. 



Choice of the hyper-parameters 6, r and $ 

In a fully Bayesian context, as proposed bylGiudici and Green[([1999l), a hierarchical prior mod- 



elling can be used. In this approach, Jand $ are considered as random quantities and a prior 
distribution is assigned to those parameters (r is fixed to 1/2). This strategy does not completely 
solve the problem since the prior distributions on 6 and <I> also depend on hyper-parameters which 
are difficult to calibrate. 



An other strategy consists in fixing the values of 6, r and <I> as in Jones et al. (2005 1. In that paper, 
r is set to encouraging sparse graphs. They choose 6 = 3 which is the minimal integer such 
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that the first moment of the prior distribution on Eg exists. Finally, they set $ = rip and using 
the fact that the mode of the marginal prior for each variance terms an is equal to r/((5 + 2), r is 
fixed to 5 + 2 if the data set is standardized. 



An intermediate strategy is suggested by Armstrong et al. (2009 1. First, they fix the value of 5 to 
4 [^assessing that such a value gives a suitably non-informative prior for Eg. Then, they consider 
different possibilities for <I>, all of the form <^ = tA where the matrix A is fixed. In all cases, for 
the hyper-parameter r, they use a uniform prior distribution on the interval [0, T] where T is very 
large. Finally, they also use a hierarchical prior on r : r ~ /3(1, 1), which leads to 



tt{G) oc 



-1 

m 



kg 



Tin 

by integration. ( ^ ) is the binomial coefficient. 



This hierarchical prior of r is also used in |Carvalho and Scott (2009 1. In that paper, they suggest 



a HIW ^(-prior approach with g = 1/n. This approach consists of fixing 5 = 1 and <I> = Sy / n. 
In our point of view, 6 measures the amount of information in the prior relative to the sample 
(see (It])): we suggest setting 6 to 1 such that the prior weight is the same as the weight of one 



observation. As pointed out by Jones et al. (2005 1, for this particular choice, the first moment of 



the prior distribution on Eg does not exist. However, for 5 = 1, the prior distribution is proper 
and we fail to see any argument in favour of the existence of a first moment. 



The structure of ^ can be discussed and various forms exist in the literature (see Armstrong et al. 
(2009) for instance). In this paper, we standardise the data and use <I> = rip. This choice leads 



to sparse graph: on average each variable has major interactions with a relatively small number 
of other variables. In that context, r plays the role of a shrinkage factor and has to be carefully 
chosen on the appropriate scale. 

In this paper, we recommend to use an empirical Bayes strategy and to fix (r, r) to its maximum 
likelihood estimation for which computation is a challenging issue. To tackle this point, a Markov 
Chain Monte Carlo (MCMC) version of the Stochastic Approximation EM (SAEM) algorithm is 
used. 

The SAEM algorithm is presented in Section|2] In Section[3} a new Metropohs-Hasting algorithm 
is introduced. Then, the proposed methodology is tested on real and simulated datasets. 



'in fact, they set 5 = 5 but they consider that fj, is unknown with uniform prior distribution: this situation corre- 
sponds to the case 5 = 4 when fj, = Op. 
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2 An empirical Bayes procedure via the SAEM-MCMC algorithm 



In the following, we set 9 = (r, r) G R*+x]0, 1[. In order to compute the maximum likelihood 
estimation of 0, we need to optimize in the following function 

I /ig(n + 5, Tip + Sy) J 

If the number of vertices is greater than 10, the number of decomposable graphs is so huge that it 
is not possible to calculate the sum over Vp. In that case, we consider the use of the Expectation- 
Maximization (EM) algorithm developed by 
Dempster et al.| ( 1977 1, noting the fact that the data Y 



(y^, . . . , y") are issued from the partial observations of the complete data (Y, Q, Sg). However, 
for such a data augmentation scheme, the E-step of the EM algorithm is not explicit and we have 
to resort to a stochastic version of the EM algorithm, like: 



the S-EM scheme introduced by |Celeux and Diebolt| ( |1992| | and |Diebolt and Celeux| ( |1993| 



where the E-step is replaced by a single simulation from the distribution of Eg) given 
Y and 0; 

2. the MC-EM or the MCMC-EM algorithms 

where the E-step is replaced by some Monte Carlo approximations 
( |McLachlan and Krishnanl|20n8] l; 



3. the SAEM algorithm introduced by Delyon et al. ( 1999 1 where the E-step is divided into a 



simulation step and a stochastic approximation step; 
4. the SAEM-MCMC algorithm 



(Kuhn and Lavielle] 2004) which extends the SAEM scheme, the "exact" simulation step 



being replaced by a simulation from an ergodic Markov chain. 

The S-EM, MC-EM and SAEM methods require to simulate a realization from the distribution 
of {Q,Tig) given Y and 6. We are not able to produce a realization exactly distributed from the 
distribution of {Q,Tjg) given Y and 9. We use the SAEM-MCMC algorithm which just requires 
some realizations from an ergodic Markov chain with stationary distribution {Q, Tig)\Y, 9. In a 
first part, we recall the EM algorithm principles and present the SAEM-MCMC scheme. In a 
second part, we detail its application to Gaussian graphical models and prove its convergence. 

2.1 The Stochastic Approximation version of the EM algorithm 

The EM algorithm is competitive when the maximization of the function 

n = IEE,,g|Y,e' {logf{Y,^g,g\9)} 
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is easier than the direct maximization of the marginal likelihood (111. The EM algorithm is a two 
steps iterative procedure. More precisely, at the k-th iteration, the E-step consists of evaluating 
Qk{G) = Q{0\ Ok-i) while the M-step updates 9k-i by maximizing Qk{0). 
For complicated models where the E-step is untractable, [Delyon et al.| ( fl999[ ) introduce the Sto- 
chastic Approximation EM algorithm (SAEM) replacing the E-step by a stochastic approximation 
of Qk{G)- At iteration k, the E-step is divided into a simulation step (S-step) of \ ^^q \ Q^^^] with 



the posterior distribution (Eg, ^)|Y, 9k-i and a stochastic approximation step (SA-step): 

Qm = {I - ik)Qk-m+ 

7fclog/(Y,4'\gW|^fc_i) 

where {'^k)k&i is a sequence of positive numbers decreasing to zero. When the joint distribution 
of (Y, Sg, Q) belongs to the exponential family, the SA-step reduces to the stochastic approxi- 
mation on the minimal exhaustive statistics. The M-step remains the same. One of the benefits 
of the SAEM algorithm is the low-level dependence on the initialization 9q, due to the stochastic 
approximation of the SA-step. 

In Gaussian graphical models, we cannot generate directly a realization from the cond itional dis- 



Kuhn and Lavielle 



(20041 suggest to 



tribution of (Sg,C/) given Y and 6k-i- For such cases, 
replace the simulation step by a MCMC scheme which consists of generating M realizations from 
an ergodic Markov chain with stationary distribution T,g,Q\Y ,9]^_i and use the last simulation 



in the SAEM algorithm. Kuhn and Lavielle (2004i prove the convergence of the estimates se- 
quence provided by this SAEM-MCMC algorithm towards a maximum of the function /(Y|^) 
under general conditions for the exponential family. 



2.2 The SAEM-MCMC algorithm on Gaussian graphical models 

In this section, we detail the application of the SAEM-MCMC algorithm to the Gaussian graphical 



model introduced in Section 1.2 More precisely, we give the expression of the complete log- 



likelihood and of the minimal sufficient statistics. Lavielle and Lebarbier ( 200 1| | applied the same 
methodology on a change-point problem. 

The complete log-UkeUhood /(Y, ^g\9) can be decomposed into three terms: 



log/(Y,g,Sg|0) = log/(Y|g,Sg) 



+ log^(Sg|g,r)+log7r(g|r). (12) 



On the right-hand side of equation ( 12 1, the first quantity is independent of 9 thus, it will not take 
part in its estimation. Using the fact that we only consider decomposable graphs and the definition 



of the Hyper Inverse Wishart distribution, the second term of the right-hand side of Equation ( 12 1 
can be developed : 
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cec 



E 



5 + 2|5| 



logdet(Eg). 



Furthermore, 



log7r(^|r) = fcglog 



1 -r 



+ mlofffl — r) . 



As a consequence, there exists ^' a function of (Y, Eg, ^, 5) independent of ^ = (r, r) such that 

log/(Y,g,Eg|r) = *(Y,Se,g,,5) 

+ ^^-^plog(T) +mlog(l - r) + 




(13) 



where (•, •) is the scalar product of M'^. Finally, following ( 13 1, the complete likelihood function 
belongs to the exponential family and the minimal sufficient statistic 5 = (5*1, 52, 5*3 ) is such 
that: 

C&C SeS 
S3{Y,g,^g) = kg. 

In an exponential model, the SA-step of the SAEM-MCMC algorithm reduces to the approxima- 
tion of the minimal sufficient statistics. Thus, we can now write the three steps of the SAEM- 
MCMC algorithm: let (7^)^^^ ^ sequence of positive numbers such that Ylk^fk = 00 and 

Efc tI < 00. 
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Algorithm 1 SAEM-MCMC algorithm 



(1) Initialize e^°\ \ 5^°^ and sf\ 

(2) At iteration k, 

• [S-Step] generate Q^''^ Tjg^^ from M iterations of a MCMC procedure - detailed in Section|3]- 
with Q, T,g\Y, 0^^~^^ as stationnary distribution. 

• [SA-Step] update (^sf'^^ using a stochastic approximation scheme: i = 1, 2, 3 



[M-Step] maximize the joint log-likelihood ( 13 1: 

^{k) ^ - 1)P + ^k) ^ £ 

(3) Set k = k + 1 and return to (2) until convergence. 



The convergence of the estimates sequence supplied by this SAEM-MCMC algorithm is ensured 



by the results of Kuhn and LavieUe (20041. Indeed, first, the complete hkehhood belongs to the 



exponential family and the regularity assumptions required by Kuhn and LavieUe ( 2004 1 (assump- 
tions M1-M5 and SAEM2) are easily verified. Secondly, the convergence requires the ergodicity 
of the Markov Chain generated at S-step towards the stationary distribution that is the di stribution 
of Q, T.q\Y , 6^ ^~^\ Finally, the properties of {jk)ken allow to apply the results of iKuhnand 



LavieUe 



(20041 and we conclude that the estimates sequence (6'^^^)fceN converges almost surely 



towards a (local) maximum of the function f{Y\9). 



3 A new Metropolis-Hastings sampler 

At each iteration k of the SAEM algorithm, a cou ple (g, Sg) has to be gene r ated under the poste- 



Giudici and Green 



(19991, 



Brooks et al. 



(20031 



rior distribution Q, T,g\Y, 6^^^^\ As described in 
and Wong et al. ( 2003| l, this simulation can be achieved using a variable dimension MCMC scheme 
like the reversible jump algorithm. In case of an HIW prior distribution on Sg, the marginal likeli- 
hood is available in closed form ([8]) and, therefore, there is no need to resort to a variable dimension 
MCMC scheme. 

At iteration k of the SAEM algorithm, the simulation of can be achieved through the 

following two steps procedure: 
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[Sl-step] ~ 7r(g|Y,e('=-i)) 



[S2-step] T.f ~ 7r(Sg|g('=), Y,^('=- 



According to (|7]), the second step [S2-step] of this procedure resolves into the simulation of HIW 
distributions the principle of which is detailed in Carvalho et al. (2007 1. 

For the first step [S 1-step], we have to resort to an MCMC algorithm but not of variable dimension 
since the chain is generated in the decomposable graphs space with p vertices. 
To sample for the posterior in the space of graphs, Armstrong et al. ( [2009) use the fact that the 
marginal likelihood is available in closed form and introduce a Metropolis-Hastings (MH) algo- 
rithm. At iteration t, their add and delete MH proposal consists of picking uniformly at random 
an edge such that the current graph with or without this edge stays decomposable; and deducing 
the proposed graph by deleting the generated edge to the current graph if it contains this edge or 
adding the generated edge otherwise. 

Let Q be the current graph, Gg the set of decomposable graphs derived from Q by removing an 
edge and Gg the set of decomposable graphs derived from Q by adding an edge. For pedagogical 
reasons, we present here an add and delete MH sampler slightly different from the one of jArm- 



strong et al. (2009 1. In our proposal, we first decide at random if we try to delete or to add an edge. 
The two schemes has exactly the same properties. Our add and delete algorithm is initialized on 
Q^^^ and the following procedure is repeated until the convergence is reached. 



Algorithm 2 Add and Delete MH proposal 
At iteration t, 

(a) Choose at random (with probability 1/2) to delete or add an edge to Q^^^^\ 

(a. 1) If delete an edge, enumerate G^^^_-^^ and generate according to the uniform distribution 
on 

(a.2) If add an edge, enumerate G^(j„i) and generate according to the uniform distribution on 

(b) Calculate the MH acceptance probability 

^p) such that 7r(^|Y, 6) is the invariant distribution of the Markov chain. 

(c) With probability p{G^^^^\Q^), accept and set Q^'^^ = Q^, otherwise reject and set 

git) = g(t-i)_ 
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The acceptance probability p{G^^ , G^) is equal to a{G^^ , Q^) A 1 where 



7r(g(*-i)|Y,5,r,$)g(gP|g(*-i)) 



with 



|G+p| 



if add 



if delete 



Note that because in general | 



ratio 



7^ the proposal distribution is not symmetric. The 



is evaluated with formula (10). 



g(t-i) and 



are not obvious and can be time-consuming. To tackle 



7r(g(t-i)|Y,<5,r,4> 

The enumerations of G 

this point, we apply the results of [Giudici and Gre^ ( |1999] ) characterizing the set of moves {add 
and delete) which preserve the decomposability of the graph. These criteria lead to a fast enumer- 
ation. 

Armstrong et al. ( 2009 1 prove that this schem^ is more efficient than the variable dimension 



sampler of Brooks et al. ( 2003 1, which is itself an improvement of the reversible jump algorithm 
proposed by Giudici and Green ( |1999| ). Their proposal is clearly irreducible and, therefore, the 
theoretical convergence of the produced Markov 
Chain towards the stationary distribution 

7r(^| Y, r) is ensured, following standard results on MH schemes. 

However, in practice, the space of decomposable graphs is so large that the chain may take quite 
some time to reach the invariant distribution. To improve this point, we introduce a data-driven 
MH kernel which uses the informations contained in the inverse of the empirical covariance ma- 
trix. To justify this choice, recall that,because of the Gaussian graphical model properties, if the 
inverse empirical covariance between vertices i and j is near zero, we can presume that there is 
no edge between vertices i and j. Then, during the MH iterations, if the current graph contains an 
edge between vertices i and j, it is legitimate to propose removing this edge. The same type of 
reasoning can be done if the absolute value of the inverse empirical covariance between vertices 
k and / is large. Indeed, in that case, and if during the MH iterations the current graph does not 
contain an edge between vertices k and /, it is legitimated to propose to add this edge. With this 
proposal, once the random choice to add or delete an edge has been done, the proposed graph is 
not chosen uniformly within the class of decomposable graphs but according to the values of the 
inverse empirical covariances. 



'In 



Armstrong et al. the step on the space of graphs represents a Gibbs step of an hybrid sampler (as already 



explained, they consider a hierarchical model where that the hyper-parameter r is a random variable). 
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Let K denote the inverse empirical covariance matrix: K = (Sv/n)^^. \ (respec- 

tively Q^^^^^U{i, j)) denotes the graph G^^^^^ where the edge {i, j) has been removed (respectively 
added). 

The Data Driven kernel is the following one : 



Algorithm 3 Data Driven MH proposal 



At iteration t, 

(a) Choose at random to delete or add an edge to Q^*~^\ 
(a. 1) If delete an edge, enumerate G^(j_i) and generate Q'^ according to the distribution such that 

1 



gp = g(*-i)\(i,j)|g(*-i) 



oc 



(a.2) If add an edge, enumerate and generate according to the distribution such that 



gp = g(t-^)uii,j)\g 



oc \K, 



(b) Calculate the MH acceptance probability 

such that 7r(g|Y, r) is the invariant distribution of the Markov chain. 

(c) With probability p{G^*^^\G^), accept G'P and set G^^^ = Qp, otherwise reject Qp and set 

git) = g(t-i)_ 



The algorithm is initialized on g^^^ and the procedure is repeated until the convergence is reached. 
Finally, in view of some numerical experiments and in order to keep the good properties in terms 
of exploration of the standard MH kernel, we propose to use in practice a combination of the 
standard add and delete MH kernel and the previously presented data-driven kernel. This point is 
detailed in the next section. 



4 Numerical experiments 

In this part, we illustrate the statistical performances of our methodology on three different data 
sets. The second one is a simulated example which highlights the convergence properties of the 



SAEM-MCMC algorithm. The first and third examples appeared in [Whittaker (1990 1 and have 



been widely used to evaluate the statistical performance of graphical models methodology, one can 
see for instance [Giudici and Green| ( [1999[ ); [Armstrong et al., ( |2009| ). Through these two examples, 
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the importance of the choice of the hyper-parameters and the efficiency of the new MCMC sampler 
are underlined. 



4.1 The Fret's heads dataset IWhittakerl (IT990ll 



Fret's heads dataset contains head measurements on the first and the second adult son in a sample 
of n = 25 families. The p = 4 variables are the head length of the first son, the head breadth of 
the first son, the head length of the second son and the head breadth of the second son. 61 graphs 
are decomposable among the 64 possibles graphs. 
We compare three different prior distributions on (T,g,Q). 



1. We first consider the prior distribution suggested by |Jones et al.| ( [2005 1 e.g. 



6=3 and r = l/{p—l) 
$ = rip with T = 6 + 2. 



2. In a second experiment, we use the prior distribution proposed in |Carvalho and Scott| ( |2009| l 
i.e 

6 = 1 (D = ^. 

n 

Furthermore, r ~ /3(1, 1) resulting into 

tt{G) oc 



m 



Q 



3. Finally, we use our prior distribution e.g, 

5=1, $ = r/p . 

and a Bernouilli prior of parameter r on the edges of Q. 

Using the SAEM algorithm described previously, we estimate r and r to 

? = 0.3925, f= 0.6052. 

On this example, there are only 61 decomposable graphs and so we are able to compute exactly 
the posterior probabilities {7r(t/|y), ^decomposable} for every prior distribution. At that point, 
we are interested in comparing the posterior probabilities of the five most probable decomposable 
graphs for the three previously prior distribution. The results are resumed in Table ??. 
The empirical Bayes estimation of r is quite smaller than the value provided by the heuristic of 



Jones et aL|(2005). As a consequence, the posterior probabiUties of graphs are really different. 



Moreover, the approach of Carvalho and Scott ( 2009 1 gives results not in agreement with one of 
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Prior 



Most probable posterior graphs and posterior probability p{G\y) 



Jones et al. 



(20051 



13 



0.24076 




0.16924 



0.11761 



CarvaUio and Scott 



(20091 



SAEM 



14 



0.30512 



13 



0.28613 




0.19979 




0.18219 



n 



0.10813 



i: 



0.1264 



Table 1: Fret's heads dataset : the three most probable posterior graphs using various prior on 

the two others method. The way the hyper-parameters r and r are considered is essential, since 
that drastically influences the results. 

4.2 Simulated Datasets 

We consider 10 artificial datasets where p = 9. These datasets are simulated according to model 
([T]l with the graph of Figure[T] r , 5 and n are set equal to 0.03, 1 and 100 respectively. 
The SAEM-MCMC algorithm has been performed on the 10 datasets in order to estimate the 
hyper-parameter r. The algorithm is arbitrary initialized with f^^) = 0.001 and f^^^ = 0.5. Given 
f^^\ Q is initialized with a standard backward procedure based on the posterior probabilities with 

f(0)_ 



The step of the stochastic approximation scheme is chosen as recommended by Kuhn and Lavielle 
(|2005): 7fc = 1 during the first iterations 1 < k < Ki, and 7^ = {k — Ki)~^ during the 



subsequent iterations. The initial guess on f(°) and r*^°^ could be far from a local maximum of 
the likelihood function and the first iterations with 7^ = 1 allow the sequence of estimates to 
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converge to a neighborhood of a local maximum. Subsequently, smaller step sizes during K — Ki 
additional iterations ensure the almost sure convergence of the algorithm to a local maximum 
of the likelihood function. We implemented the SAEM-MCMC algorithm with Ki = 100 and 
K = 300. At the S-step of the algorithm, the Markov Chain supplied by the MCMC algorithm 
is of length M = 500 during the first 5 iterations of the SAEM scheme and M = 10 for the 
remaining iterations. 

Figure|2]illustrates the convergence of the parameter estimates considering 2 arbitrary chosen data- 
sets. The estimated sequences are represented as a function of the iteration number. During the first 
iterations of SAEM, the parameter estimates fluctuate, reflecting the Markov Chain construction. 
After 100 iterations, the curves smooth but still continue to converge towards a neighborhood of a 
local maximum of the likelihood function. Convergence is obtained after 300 iterations. 
Considering the 10 datasets, fir the parameter r the relative bias is negligible and the relative root 
mean square error (RMSE) amounts to 32.10%. Note that the same study has been conducted with 
a uniform prior on Q In that case, the algorithm only involves the parameter r and the correspond- 
ing RMSE is equal to 23.5%. 




50 100 150 200 250 300 50 100 150 200 250 300 



Figure 2: Simulated datasets: evolution of the SAEM-MCMC f^^^ estimations (left) and r^^^ 
estimations (right) on 2 datasets. 
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4.3 The Fowl bones dataset IWhittakerl (|T990ll 



This dataset concerns bone measurements which are taken from n = 276 white leghorn fowl. The 
6 variables are skull length, skull breadth, humerous (wings), ulna (wings), femur (legs) and tibia 
(legs). On such a dataset, the determination of the best decomposable Gaussian graphical model 
results in finding the best graph within 18, 154 decomposable graphs (55% of the possible graphs). 
Using this example, we aim at illustrating the fact that a careful choice of the transition kernel in 
the MCMC algorithm ensures a better exploration of the support of the posterior distribution. To 



do this, we compare the performances of the add and delete proposal of Armstrong et al. (20091 
to those given by the data-driven one. 

In a first step, we use the SAEM-MCMC algorithm to calibrate the value of r and r. We obtain 
T* = 0.674 and r* = 0.69. 

In a second step, using this fixed value of r and r, we generate 2 Markov chains of 110 000 
iterations. The first one is simulated using the add and delete kernel. For the second one, we use 
exclusively the add and delete kernel during 10 000 iterations : this phase of bum-in allows a large 
exploration of the decomposable graphs space. During the last 100 000 iterations, we alternatively 
and systematically use the add and delete and data-driven kernels. 

To illustrate the performance of this new kernel, we compute exactly the posterior probabilities 
p{Q\Y; T*,r*) for each decomposable graph of size p = 6. We concentrate our efforts on the 
graphs such that p{Q\Y; t* ,r*) < 0.001 (resulting into 107 graphs among the 18154 ones) as- 
suming the the other ones are of small interest because nearly never reached by the Markov chains. 
For each graph of interest Qint, we count the number of times each Markov Chain reached it (after 
having removed the bumin period). We finally obtain an estimation of the posterior probability by 
each chain: 



7fi(^mt|Y; T*,r*) 
7f2(^mt|Y;T*,r*) 



\{t;g? = gint}\ 

100 000 
100 000 



These values are compared to the theoretical ones Y; r*, r*). In Figure [3j we plot the 

estimated densities of the quantities relative errors 



in solid line, and 



MGint\Y; T*,r*)- p(g,nt|Y; T*,r*) 
p(^mt|Y;r*,r*) 

MOint\Y; T*,r*)- p{g^nt\Y■, T*,r*) 



X 100 



X 100 



p{gint\Y;T*,r*) 
in dashed line. 

We note that the density corresponding to the errors involved by the data-driven kernel is more 
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Figure 3: Fowl bones data set. densities of the relative errors on tiie posterior probabilities for the 
107 most probable graphs, add and delete kernel in solid line and data-driven kernel in dashed 
line. 



concentrate around the value 0. The large errors in the add an delete density are due to the graphs 
with small probabilities. Thus, the new kernel explores more efficiently the posterior distribution. 
The acceptance rate is higher for the data-driven chain (see Figure|4]l. 



5 Conclusion and discussion 



An empirical Bayes strategy estimating prior hyper-parameters in a Gaussian graphical model 
using a SAEM-MCMC algorithm is introduced. 

That proposal does not depend on any calibrating parameters and can be viewed as a default option 
for decomposable graphical model determination. Some empirical studies show the relevance of 
the proposed approach and the good properties of the introduced algorithms. 
However, [Scott and Berger (2010 1 has recently found considerable differences between fully Ba- 
yes and empirical Bayes strategies in the context of variable selection. It would be very interesting 
to investigate, both from theoretical and practical perspectives, on such a discrepancy in the case 
of decomposable graphical model selection. 
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Figure 4: Fowl bones data set: evolution of the acceptance ratio for the add and delete Markov 
chains (solid Une) and the data driven Markov chains (— dashed line). 
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